**************************
*NES emulation discussion*
**************************
Brad Taylor (big_time_software@hotmail.com)

3rd release: September 6th, 2003


This document is designed not only to be a guide to programmers who are 
writing an NES emulator, but also for those individuals who have some 
concern for efficiency and/or emulator throughput (or, want to improve these 
things). Granted, with today's several gigahertz processors (or otherwise, 
super-fast computers), general concern for efficiency seems to be going the 
way of the do-do (i.e., extinct). However, writing an emulator is by no 
means a trivial task. Even seasoned programmers will find that writing a 
blazingly fast NES emulator is pretty difficult, simply because of the way 
NES games use the hardware. My opinion of a blazingly fast emulator would 
have to be one which can run full speed on a 486DX2/66 computer. 
Unfortunately, this level of optimization goes far beyond the extents that 
most programmers are willing to go to (i.e., hand-coding assembly 
subroutines) to engineer somthing as complex as an emulator, and this guide 
doesn't expect the coder to be this dedicated to efficiency, anyway. 
Instead, this guide can give hints and tips on general optimizations, but 
also points out further optimizations possible for assembly and low-level 
coders. Additionally, it does gear toward optimizations specifically for 
modern generic personal computers (PC's), so keep this in mind if you're 
writing your emulator for a different type of computer.


This document has overgone a major overhaul from the second release. The 
reason it has been reworked is because I have learned a great deal more 
about coding efficiency on x86-based platforms. Compared to this version, 
the older version of this document with some of it's optimization "tips", 
are questionable at best. The featured addition to this document is an 
MMX-coded example of the heart of a very fast playfield renderer.


topics discussed
----------------
General PPU emulation
Pixel rendering techniques
Merging playfield & object pixels
Frame store optimizations
CPU emulation


references to other works of mine related to this document
----------------------------------------------------------
* 2C02 technical operation *
	discusses the low-level operation of the PPU in great detail.

* PPU addressing *
	a visual & written documentation of the operation of the PPU's scroll 
counters.

All documents can be found at http://nesdev.parodius.com.


+---------------------+
|General PPU emulation|
+---------------------+
Most likely, the key to your emulator's performance will be based on the 
speed at which it can render NES graphics. It's pretty easy to write a slow 
PPU render engine, since overall there's a good deal of work that has to be 
done. However, emulating the NES's PPU is not very difficult. What makes 
accurate emulation of it difficult however, is the trickery various NES 
games use to achieve special video effects (like split screen scrolling), 
otherwise not possible by "clean" or conventional means. In reality, all 
these "tricks" are simply accomplished by writing to the appropriate PPU (or 
related) registers at the right time during the rendering of a frame 
(picture).

On a hardware level, the CPU & PPU in the NES run simultaniously. This is 
why a game can be coded to make a write out to a PPU register at a certain 
time during a frame, and the result of this is that the (on-screen) effect 
occurs in a specific location on the screen. Thus, the first instinct one 
has for writing a NES emulator is to execute both the CPU & PPU engines 
alternately on every (NES) clock cycle. The results of this will give very 
accurate emulation, BUT- doing this will also be VERY processor intense 
(since on a hardware level, the PPU does much more work per clock cycle). As 
a result, emulators coded like this turn out to be the slowest ones.


PPU info
--------
NES graphics consist of a single scrollable playfield, and 64 
objects/sprites. The screen resolution is 256*240 pixels, and while games 
can control the graphics on a per-pixel basis, it is usually avoided since 
it's pretty difficult. Instead, the PPU makes displaying graphics easier for 
the programmer by dividing the screen up into tiles, which index an 8*8 
pixel bitmap to appear in that particular spot. Each object defines 1 or 2 
tiles to be displayed on a randomly-accessable xy coordinate on the screen. 
There are also 2*4 color palettes in the PPU that bitmap data can refer to 
(playfield & object bitmap data each have their own dedicated palette). Each 
palette has 3 indexable colors, as tile bitmaps only consist of 2 bits per 
pixel (the 00 combination is considered transparency). A single transparency 
color palette register is also defined, and is only used as the on-screen 
color when overlapping pixels (due to objects being placed on the playfield) 
of all playfield/object pixels are defined as transparent.

As graphics are rendered (as described in the "2C02 technical operation" 
document), the name tables are accessed sequentially to reference a tile's 
bitmap (as described in the "PPU addressing" document), which gets used as 
the pixel data for the area of the screen that the name table index entry 
corresponds to (plus the scroll register values). Attribute tables, which 
are layed out the same way that name tables are (except with lower 
resolution- 1 attribute entry represents a 2*2 cluster of on-screen tiles), 
determine the palette select value for the group of tiles to use (1 of 4).

Objects attribute memory (sprite RAM, or "OAM" which contain private tile 
index and palette select information) is evaluated every single scanline 
(y-coordinate entries are examined), and in-range objects have thier tile 
bitmaps loaded into the PPU inbetween scanlines. The contents are then 
merged with the playfield's pixel data in real-time.


Accurate & efficient PPU emulation
----------------------------------
For the most part, PPU operations are linear and sequential, making them 
easy to design algorithms for. By rendering playfields and objects on a 
per-tile basis, emulation can be made quick & easy. However, games that use 
special effects (mid-frame PPU trickery) require special handling, which in 
turn complicates algorithm design.

By implementing a clock cycle counter in the CPU core, it is possible for 
emulated PPU hardware to know exactly when a read/write is occuring to a 
PPU-related register (or otherwise, a register which is going to change the 
rendering of graphics from that point on). Therefore, when a write to a PPU 
register occurs, the PPU engine can then determine if the write is going to 
change the way the picture is going to be rendered, and at the exact clock 
cycle (which really translates into a screen position).

For example, say the CPU engine is executing instructions. Then, on clock 
cycle 13000 (relative to the last VINT), a write to the PPU's scroll 
registers are made (which causes a split-screen effect). Now, first the PPU 
translates 13000 CC's into X/Y coordinates (in this case, on-screen scanline 
93, roughly pixel #126 (the equations to do these calculations will be 
revealed later)). Ideally*, all pixels before this point will now be 
rendered to a buffer, using the data in the PPU registers prior to the 
write. Now the screen area before the write occured has been rendered 
accurately, and the screen will progressively continue to be updated in this 
fashion as more mid-frame writes occur. If no more occur, when the CPU 
arrives at the # of clock cycles per frame, the rest of the image (if any) 
can be rendered.

* As will be discussed in the following "the virtual frame buffer" section, 
maintaining a "stack", or list of mid-frame PPU changes (which effect how 
successive rendering in the frame occurs), and only executing a PPU render 
routine once per frame (which then processes the stack of mid-frame writes) 
is a more efficient way of organizing CPU/PPU program control in your 
emulator.


Knowing when to update the screen
---------------------------------
The following list describes PPU status registers/bits that if a game 
modified/changed mid-frame, would change the way the rest of the frame is 
rendered. O = update objects, P = update playfield.

O	object enable bit
O	left column objects clipping
O	8/16 scanline objects
O	active object pattern table
O	pattern table bankswitch (which effects active object pattern table)

PO	color emphasis bits
PO	black & white/color select

P	playfield enable bit
P	left column playfield clipping
P	scroll registers
P	X/Y name table selection
P	name table bankswitch (hypothetical)
P	active playfield pattern table
P	pattern table bankswitch (which effects active playfield pattern table)

Note that any PPU mapped memory (which means name, pattern, color & palette 
tables) can only be changed while objects & the playfield are disabled 
(unless cartridge hardware provides a way to do this through the CPU memory 
map). Since the screen is blanked to black during this time (regardless of 
the current transparency color the palette is programmed with), these writes 
do not effect how the screen is rendered.


Collision flag
--------------
Games without hardware for scanline counting often poll this bit to find out 
when to make a write out to a PPU register which will result in a split 
screen, or a pattern table swap/bankswitch. The collision flag is set when 
the first non-transparent pixel of object 0 collides with a playfield pixel 
that is also non-xparent. Since the screen position of the first colliding 
pixel can be determined at any time (and therefore, exact CPU clock cycle at 
which the collision is expected to occur), when a game requests the status 
of this flag for the first time, a routine part of the PPU engine can 
calculate at which clock cycle this flag will be set (calculations will be 
shown later). Subsequent requests for the collision flag's status after this 
would then only require the engine to compare the current CPU clock cycle, 
to the calculated collision clock cycle. Whenever a mid-frame change occurs 
(whether it effects the playfield, or objects), the clock cycle at which the 
collision flag will go off will have to be recalculated (unless it has 
already gone off).


MMC3 IRQ timer
--------------
The MMC3's IRQ timer relies on the toggling of one of the PPU's address 
lines. Basically, it's counting operation is more or less at a constant rate 
(meaning predictable). However, when the PPU bus is disabled (via disabling 
the playfield & objects, or during the V-blank period), the counter must 
quit counting. Manual toggling of PPU address bits during this time will 
have to be intercepted, and the IRQ timer advanced appropriately.


CPUCC to X/Y coordinate equations
---------------------------------
The PPU renders 3 pixels in one CPU clock. Therefore, by multiplying the CPU 
CC figure by 3, we get the total amount of pixels that have been rendered 
(including non-displayed ones) since the VINT.

341 pixels are rendered per scanline (although only 256 are displayed). 
Therefore, by dividing PPUCC by this, we get the # of completely rendered 
scanlines since the VINT.

21 blank scanlines are rendered before the first visible one is displayed. 
So, to get a scanline offset into the actual on-screen image, we simply 
subtract the amount of non-displayed scanlines. Note that if this yeilds a 
negative number, the PPU is still in the V-blank period.

PPUCC = CPUCC * 3
Scanline = PPUCC div 341 - 21;	X- coordinate
PixelOfs = PPUCC mod 341;     	Y- coordinate
CPUcollisionCC = ((Y+21)*341+X)/3

Note that if the PixelOfs equation yeilds a number higher than 255, the PPU 
is in the H-blank period.


+--------------------------+
|Pixel rendering techniques|
+--------------------------+
3 rendering techniques are described in this section. They are all real-time 
techniques. The third version of this document (unreleased) discussed a tile 
caching-based rendering solution. However, tile caching quickly loses it's 
effectiveness for those games that use mid-frame (or even mid-scanline) 
trickery to change character sets, or even palette values. Instead, this 
release contains a better & faster method of rendering using MMX 
instructions, and hence why I personally recommend this solution over any 
other one discussed here.


Basic
-----
This method, which is the most straightforward, is to store the PPU's 
52-color matrix as constant data in the VGA palette registers (or otherwise, 
other palette registers used for an 8-bit per pixel graphics mode). Before a 
pixel can be drawn, pixel color is calculated (via pattern table & palette 
select data). The PPU palette registers are looked up in some way or 
another, and the contents of the palette register element is written to a 
virtual frame buffer as the pixel data. This technique is the easiest to 
implement, and provides the most accurate PPU emulation. However, since 
every pixel drawn requires an independent palette look-up, this method is 
naturally very slow.

One way to speed up this rendering style is to create a palette table 
designed for looking up 2 or more pixels simultaniously. The advantages are 
clear: you could easily shave alot of time (close to half with a 2 
simultanious color lookup) off playfield rendering. The disadvantages are 
that the lookup table grows from 2^4*2=32 bytes for a 2-pixel lookup, to 
2^8*4=1024 bytes for a 4-pixel lookup (and 1024 bytes makes this table not 
suitable for staying in the CPU cache of an older CPU). Each of the 
palette's 4 colors is also mirrored across these tables, and this has to be 
maintained. Since I've never tried this optimization technique, I can't tell 
you how effective it is (or when it stops being effective).

Another way to increase the speed of this approach is to change the bit 
ordering of the pattern tables stored in memory to favor this rendering 
algorithm. For example, store the bitmap for any scanline of a tile in an 8- 
2-bit packed pixel format, instead of the 2- 8-bit planar method used by 
default. By doing this, it will allow the tile rendering routine to easily 
extract the 2-bit number for indexing the 4 color palette associated with 
the particular tile. Of course, by changing the pattern tables, whenever 
pattern table memory is read or written to, the format of the data will have 
to be converted. Since this happens much less often (even in games that use 
CHR-RAM), it's a good idea.


VGA palette register indexed
----------------------------
This method involves programming the VGA palette registers to reflect the 
direct values the PPU palette registers contain. The VGA palette would be 
divided into 64- 4 color palettes. When sequential horizontal pixels are to 
be drawn, a large (32-bit or more) pattern table data fetch can occur 
(pixels for pattern table tiles should be organized in memory so that the 2 
bits for each horizontally sequential pixel are stored in 8-bit increments). 
This chunk of fetched pixel data can then be masked (so that other pixel 
data from the chunk is not used), an indexed "VGA palette select" value can 
be added to the value, and finally can then be written out to the virtual 
frame buffer in one single store operation. The "VGA palette select" value 
is fetched via the VGA palette select table, which corresponds to the 8 
classic PPU palettes (4*2 elements in the table; therefore, a tile's 
attribute data (either PF or OBJ) is used as the index into this table). 
This table indicates which 4-color group of 64 groups in the VGA palette to 
use for color selection for the group of pixels being written. The idea is 
that when a mid-frame palette change occurs (or at any time, for that 
matter), the affected PPU palette in this table is changed to point to where 
the new palette modifications will be made in the VGA's palette. The 
corresponding VGA palette entries will also have to be updated appropriately 
(generally, VGA palette updates will be made in a ring-buffer fashion. A 
pointer which keeps track of the first available 4 palette entries will be 
incremented when any entries in a 4-color PPU palette are changed). 
Basically, this method offers the fastest possible way to render NES 
graphics, since data is fetched from pattern table memory and written 
directly to the virtual frame buffer. The number of pixels processed 
simultaniously can be as high as 8 (with 64-bit integer instructions). 
However, the # of mid-screen PPU palette modifications possible is limited 
to 64 times (or 32 for PF and 32 for OBJs, if one of the bits in every pixel 
needs to be used to distinguish a playfield pixel from an object), but 
multiple consecutive modifications to a single 4-color PPU palette only 
count as one actual modification.


MMX instruction-based rendering
-------------------------------
With the introduction of the Pentium MMX, the entire x86 architecture not 
only got alot more complex, it also got alot better. MMX instructions are a 
set of single instruction, multiple data, RISC-like instructions. Nearly all 
the instructions have a very low 1 or 2 clock cycle latency across all x86 
class CPUs which support them, hence these instructions are very desirable 
to use. The instructions work around an 8 element, 64-bit flat register 
file, which overlaps the FPU's mantissa registers. The BIG deal about use of 
MMX instructions for pixel rendering is that 8 pixels can be operated on 
simultaniously, providing each pixel is no larger than a byte. The following 
assembly-written routine can fully calculate pixel color for 8 horizontally 
sequential pixels for every loop iteration (the example actually renders the 
first 4 scanlines of a tile).

The pattern tables have already been reorganized so that the bitmap data for 
4 scanlines of tile data can be loaded into an MMX register, and used in the 
most efficient way possible. Pixel data for 4 sequential scanlines under the 
same horizontal coordinate is stored in a single byte, with the 2 MSBs 
containing the lowest logical scanline coordinate. Sequential bytes, up to 
the 8th one, contain the pixel data for every successive horizontal 
position. Finally, the first 8 bytes of a tile's pattern table data contain 
the full bitmap data for the first 4 scanlines of the tile, and the next 8 
bytes contain the last 4 scanlines.


####################################

;register assignments
;--------------------
;EAX:	destination pixel pointer
;EBX:	points to the palette to be used for this tile (essentially determined 
by the attribute table)
;EDX:	points to a 32-byte (4 tile scanline) buffer for implementing fine 
horizontal scroll effects
;ESI:	source pointer for 32-pixel bitmap to be loaded from pattern table
;MM4:	(8 - fine horizontal scroll value)*8
;MM5:	(    fine horizontal scroll value)*8


;fetch 32 pixels from pattern table, organized as 8 horizontal x 4 vertical.
	movq        mm3,[esi]
	mov         ecx,-4;           load negative loop count

;move constants related to color calculation directly into registers. These 
have to be stored in memory since MMX instructions don't allow the use of 
immediate data as an operand.
@1:	movq        mm0,_BFx8;        contains 0BFBFBFBFBFBFBFBFh
	movq        mm1,_FFx8;        contains 0FFFFFFFFFFFFFFFFh
	movq        mm2,_3Fx8;        contains 03F3F3F3F3F3F3F3Fh

;generate masks depending on the magnitude of the 2 MSBs in each packed byte 
(note that this is a signed comparison).
	pcmpgtpb    mm0,mm3
	pcmpgtpb    mm1,mm3
	pcmpgtpb    mm2,mm3
	psllq       mm3,2;            shift bitmap to access next scanline of 
pixels

;to perform color lookup, a precalculated palette table is used & ANDed with 
the resulting masks of the last operation. Since XOR operations are used to 
combine the results, this requires the elements in the palette table to be 
XORed with adjacent values, so that they'll be cancelled out at the end of 
the logic processing here. The required precalculated XOR combination of 
each color element is shown in the comments below by the corresponding 
element. Note that each lookup is 8 bytes wide; this requires the same 
palette data for a single element to be mirrored across all 8 sequential 
bytes.
	pand        mm0,[ebx+00];     2^1^0^3
	pand        mm1,[ebx+08];     2^1^0
	pand        mm2,[ebx+16];     2^1
	pxor        mm0,[ebx+24];     2
	pxor        mm1,mm2

;this logic performs shift functionality, in order to implement fine 
horizontal scrolling. The alternative to this is simply writing 64 bits out 
to unaligned addresses for fine H scroll values other than zero, but since 
this can incur large penalties on Intel-based processors, this is generally 
the preferred way to generate the fine horizontal scroll effect.
	movq        mm2,[edx+ecx*8+32]
	pxor        mm0,mm1
	psrlq       mm2,mm5
	movq        [edx+ecx*8+32],mm0
	psllq       mm0,mm4
	por         mm0,mm2
	movq        [eax],mm0

;loop maintenence
	add         eax,LineLen;      advance pixel pointer to next scanline 
position
	inc         ecx
	jnz         @1


###################################

Obviously, code will have to be added here to compensate for games that may 
require scanline-granular PPU rendering, but this won't change the # of 
instructions in the inner loop. At any rate, the inner loop can execute in 
just about 14 clock cycles on an Athlon, and that translates into 14/8 = 
1.75 clock cycles per fully rendered pixel.


+---------------------------------+
|Merging playfield & object pixels|
+---------------------------------+
The most efficient way to effectively combine playfield & object data into 
your final rendered frame, is to always first, render your playfield (or a 
section of it, in the case of dealing with a split screen) directly to the 
image buffer itself. At this point, to effectively merge object pixels with 
the playfield's, each pixel in your image buffer must have an extra 2 bits 
associated with it, one of which will represent the transparency status for 
a playfield pixel, and the other the same, except for object pixels (when 
drawn later).

Naturally, after rendering the playfield, the image buffer won't have any 
pixels with the transparency status for object pixels marked as false. But 
now, as objects are rendered, the condition on that the actual pixel is 
drawn, depends on these two transparency status bits, the objects own 
transparency status, and it's priority. Starting in the order from object 0 
(highest priority) up to 63, object bitmaps are "merged" with the playfield, 
in the fashion that the following few lines of pseudo-code will show:

IF(SrcOBJpixel.xpCond=FALSE)THEN
   
IF((DestPixel.OBJxpCond=TRUE)AND((DestPixel.PFxpCond=TRUE)OR(SrcOBJpixel.Pri=foreground)))THEN
     DestPixel.data := SrcOBJpixel.data
   FI
   DestPixel.OBJxpCond := FALSE
FI

So, as you can see, the destination's OBJxpCond is marked as false, even if 
the object's pixel is not meant to be drawn. This is to prevent the pixels 
of lower priority (numerically higher-numbered) objects from being drawn in 
those locations.

This may raise the question, "Why do you render objects in the order of 
0->63 (effectively requiring 2 bits for transparency status), when you can 
render them in the opposite direction (which only requires 1 bit for 
transparency status)?" The answer is because of what happens on a priority 
clash (see the "PPU pixel priority quirk" section of the "2C02 technical 
operation" document). Rendering objects in order of 0->63 is the only way to 
emulate this PPU feature properly (and some games DO depend on the 
functionality of this, as it provides a way to force the playfield to hide 
foreground priority object pixels). Otherwise (for 63->0), it would be 
neccessary to merge objects to an image buffer filled with the current 
transparency color, and then, merge playfield data with the buffer as well. 
Granted, this technique will only require 1 transparency (background 
priority) status bit per pixel, but since merge operations are slow, and 
this technique requires way more of them, this technique is inferior to the 
aforementioned one.


other tips
----------
- Depending on your implementation of pixel rendering, you may be able to 
store the 2 transparency status bits inside the pixel data itself. For 
example, if only 52 combinations of a rendered pixel are being generated, 
the upper 2 bits in the pixel's byte can be used for this storage. This may 
mean that you'll have to mirror your video buffer's palette register RGB 
information 4 times, but is otherwise is a good idea. For 8-bit color VGA 
modes, a legacy mask register (3C6h) allows the programmer to mask out any 
bits of the written pixel data that are unrelated to color generation.

- Don't use branching to avoid drawing a pixel out somewhere. First of all, 
it only allows you to process 1 pixel at a time, which is slow. Second, CPUs 
have a hard time predicting branches based on random data (or at minimum, 
data that produces a branch pattern which is too long to be stored in the 
CPU's branch target buffer). Finally, sequences of arithmetic and logical 
operations can be used to merge multiple bytes of data simultaniously 
(espically with MMX instructions).

- Inline code in small loops with a constant # of iterations, espically if 
the loop count is low, and is the most inner. This reduces overhead by 
avoiding a backwards branch, and the loop & index counters required. For 
example, when drawing tiles, it would be a good idea to inline the code to 
draw 8 horizontal pixels.


+-------------------------+
|frame store optimizations|
+-------------------------+
One of the simplest approaches to render emulation is to draw the entire 
playfield to the video buffer, and then place applicable object data in the 
buffer afterwards (this makes object/playfield pixel decisions easier to 
calculate since the playfield pixels are already rendered). This 
straight-forward approach would also be the most efficient way to deal with 
rendering NES graphics, were there not certain caveats of using the video 
buffer. Since I don't know how windows or Linux programmers do this sort of 
thing, this topic may not be applicable to them.

- Video frame buffer reading is painfully slow. No matter how fast of a 
video card a computer has, reading video memory is going to be at least 10 
(ten!) times slower than writing to it. This would effect the time when 
objects are being rendered, when the contents of the playfield that underlap 
the object need to be read in & merged with the object's pixels.

- Writing to random places in video memory is painfully slow. Because modern 
I/O devices (PCI,AGP) in PC's share address lines with data lines (over the 
same bus), there's overhead in writing to a random place in the video 
memory. This idea was designed with data streaming in mind, so that when 
sequential writing occurs (a pretty common thing), bus lines which would 
otherwise be wasted on keeping track of a sequentially-increasing address 
can now be used to carry data. So, a non-sequential transfer of data to the 
video card could take as much as double the amount of time that a sequential 
transfer does. This point alone makes rendering the playfield on a per-tile 
basis (where you're basically only storing 8 sequential pixels at a time for 
each scanline of the tile) directly to the video buffer one of the worst 
approaches. Sequential data transfers to the video card are close to optimal 
at 256 bytes at a time and up.

- Writing to random, unaligned places in video memory is incredibly slow. 
This is because a merge operation has to take place on the hardware level: a 
read from the video card (which is already slow), and the write. However, 
this operation is only required at the beginning & end of an unaligned, 
sequential transfer. Thus, unaligned data streaming to the video card has 
less of penalty the larger the transfer is (I measured an 11% additional 
overhead on an unaligned sequential store of 512 bytes to the video card 
buffer, and double that for a 256-byte xfer). Video addresses which are 
divisible by 64 bytes are considered to be aligned.

- Writing to the video memory in small (byte-by-byte) store operations is a 
bad idea. While modern PC CPU/chipset hardware may be able to detect & 
combine several little sequential stores to the video buffer as full-size 
ones, there's no guarantee this will happen. Older hardware certainly 
doesn't do this. And- if the small writes aren't being combined together, 
than guess what? The chipset will perform a merge operation for each data 
item transfered that isn't full size. Obviously, this could be the worst 
possible way to send data to the video buffer (well, next to storing small 
data at random places in video memory).

- Writing to a non- linear frame buffer (LFB) is slow. At least on one card 
I tested, there was a 333% increase in video buffer write speed, after 
switching from using the legacy one at address 000A0000. I understand that 
basically any PCI video card has LFB-capabilities, but may be inaccessable 
due to it's BIOS, or drivers. I guess that this is really a responsibility 
of the OS, but either way: use the LFB any way you can.


the virtual frame buffer
------------------------
Now you should see that it's just not a good idea to render graphics 
directly to the video buffer (although I don't think any one would do this, 
anyway). Prior versions of this document discussed using a virtual frame 
buffer, which was basically a buffer allocated in regular memory used to 
render graphics to (instead of directly to the video buffer). When the 
virtual buffer was full, it would then be copied to the video buffer in a 
large, sequential operation (just the way the video card likes it!). It 
embarasses me to admit that this method is actually quite inefficient, and 
I'd like to clear the air on that issue. To understand why it's inefficient, 
let me explain a little about how CPU caches work.

If you know how virtual memory works, well, a CPU cache is basically like a 
hardware version of this, although not for a disk drive, but main system 
RAM. A CPU caches data on a line-by-line basis. Each line is anywhere from 
16 (486) to 32 (Pentium) to 64 (Athlon) bytes in size, and will most likely 
grow bigger in future CPUs. So, if only a single byte needs to be read from 
memory, an entire line is actually loaded into the cache (this is why data 
alignment, and grouping related data fields together in records is important 
to guarantee the effectiveness of a CPU's cache). This action also pushes 
another line out of the cache (ideally the least-recently used one), and if 
it's dirty (modified), it will be written back to main memory.

486's started the on-chip CPU cache trend, with a whole 2K bytes for data 
and code. 486DX4 models had 8K bytes. Pentiums had 16K, etc. The point is, 
the size of the cache is basically the size of memory that the CPU can 
randomly access for next to no clock cycles at all (providing the target 
address is in there). Of course, this only applies to the L1 cache, since 
there's some latency involved with accessing L2 cache, depending on it's 
location (older CPU's never had on-chip L2 cache; it was located on the 
mainboard).

The trick to effective use of the cache is all how software is written. The 
best thing to do is to divide software program control up evenly between the 
individual tasks it performs, so that only a minimum amount of cache misses 
occur when the task is switched, and completely different data areas are 
being accessed by the program.


the virtual frame buffer caveat
-------------------------------
Lets consider the virtual frame buffer (VFB) model. We start rendering our 
playfield. Name tables and pattern tables are accessed, and that's fine (the 
name tables are easily cached, and even some pattern table data gets 
cached). Then, we store our rendered pixel data to our buffer. The pixel 
data is stored to the VFB using a data size of 4 bytes (or otherwise, the 
biggest size that the processor will allow the programmer to store with). 
However, the CPU's cache line size is always bigger than this, and therefore 
the CPU performs a merge operation with written data, and the cache line of 
the data being written to.

Now- here's the first problem: the target of the store operation to the VFB 
is unlikely to be in the cache. This means that the CPU ends up actually 
*reading* main memory after your first 4-byte pixel store. Of course, now 
you can write to this line for free, but main memory access is slow, and 
considering what we're doing here (which is exclusively store operations), 
it's kind of ridiculous that the programmer has no way of telling the 
processor that the merge operation (and moreover the reading of main memory) 
is unneccessary, since we plan on overwriting all the original data in that 
particular line (newer x86 processors offer instructions for better 
management of non-temporal storage).

Anyway, you get the idea: after every few stores to the VFB occur, a new 
line from the VFB will be read in. But guess what? this isn't even the worst 
part of it. As you keep filling the VFB, your CPU's cache overflows, since 
your CPU's L1 cache is smaller than the VFB you're working on. This means 
that not only will your VFB-rendering eventually push any lines out of the 
cache which aren't used directly by the render routine (causing lost cycles 
for even local routines that may need them immediately after the render), 
but after the render when you go to copy the VFB to the video memory, the 
entire buffer has to be loaded back into the CPU's cache.

Of course, size of CPU cache is everything here. Currently hovever, to my 
knowledge, the largest L1 cache implemented in a CPU is 64K bytes, and 
that's only for Athlon (the P4's is smaller than this). With that said, 
modern CPUs are simply unfit for this style of rendering.


scanline stores
---------------
By reducing the size of the VFB from full size down to a few scanlines (or 
even just one), most or all of the caveats of what has been mentioned can be 
avoided. Since typically a VFB scanline is 256 bytes (in the example for the 
PPU), this makes the memory requirement small enough to ensure modest 
performance even on a 486.

Of course, this creates a new problem for writing the PPU render engine- 
tiles can no longer be rendered completely (unless you're using an 
8-scanline VFB, but the rest of this topic assumes you're using only a 
single scanline VFB). Some overhead caused by only rendering a single 
scanline of a tile at a time can be avoided by pre-calculating pointer work 
for each sequential tile, and storing it in an array, so that calculations 
can be reused for the tile's other scanlines. A similar technique can be 
done for object pointer calculations as well.

Prehaps a possible performance boost obtainable through a careful scanline 
rendering engine design, is that storing rendered pixels directly to the 
video buffer may be permitted, since all pixels on the scanline are rendered 
sequentially (and thus, can be stored out that way). However, there are 
conditions that determine the effectiveness of this. First, obviously this 
could only be done for the tiles on the scanline where objects are not 
placed; a temporary buffer in memory will still have to be used to make 
playfield/object decisions for those tiles that do (and not to mention, some 
method of determining the sequence in which objects appear on the scanline). 
The second condition depends on alignment. If the PPU's fine scroll offset 
is 0 (or even 4- if using 32-bit store operations), pixel data can be 
written out to the video buffer directly with no performance penalty (and 
also with no leading/trailing tile debris on the ends of the scanline 
written to video mem). If it is other than 0 or 4, it may be faster to store 
it unaligned to a scanline VFB, then copy it back aligned. Another option is 
to perform byte shifting/rotating on the fetched bitmaps before storing 
directly to the video buffer (as is done with the MMX example shown above).


+-------------+
|CPU emulation|
+-------------+
6502 Opcodes for a particular game must be processed, and a unique task 
carried out for each one (or at least, the documented ones). This section 
discusses efficient implementation of a 6502 engine for the NES.


Basic optimizations
-------------------
- Interpreted (jump table-based) vs. dynamic recompiliation (converting 6502 
instruction sequences to x86 ones on-the-fly). Basically, dynamic 
recompiliation has alot of potential for generating optimal CPU emulation 
performance. Unfortunately however, these performance benefits can only be 
produced on modern computers, due to the large amount of data structures 
that need to be maintained to allow it to work (lots and lots of memory is 
required as well). On the other hand, jump table-based 6502 instruction 
processing is more favorable on slower machines, since even on a Pentium, an 
indirect jump only incurs a 5 clock cycle penalty, providing it's not 
predicted (which is usually the case. Predicted jumps of any type are just 
one clock). By reducing the size of your processor core (by using branchless 
algorithms for 6502 effective address calculations, or other places it can 
be benificial), this would only require your core to do indirect jumping to 
handle the emulation of different instructions- which will only require a 
look-up table with about 20 or 30 branch targets.

- Weigh the pros and cons of writing a 6502 emulator with an exclusive 
routine to execute for every instruction carefully. Exclusive routines for 
all 151 6502 opcodes only makes the throughput part of instruction emulation 
quicker. However, due to the larger code size, it is more likely that more 
performance will be lost on older computers, due to the core's resident size 
forcing other pieces of code to be pushed out of the CPU cache.

- Implement a clock cycle counter into your 6502 engine, which will be 
maintained by every 6502 instruction executed. This counter will mainly be 
used by the PPU to figure out how timed writes will effect how the output 
image will be rendered. However, if used also as a terminal counter, when 
the count expires, program control can be transferred to a different part of 
the 6502 engine to emulate the behaviour of an event that would normally 
occur on the NES, on that exact clock cycle (i.e., a timed IRQ event). Also, 
don't forget that you can manage any number of "virtual cycle counters", 
without ever having to make the CPU core maintain more than one physical 
one. NES hardware may have several IRQ-generating counters going 
simultaniously, but the order in which each will cause an IRQ is always 
known to the emulator, which is why the cycle count register only has to be 
programmed with the count value of the next IRQ to occur.

- As 6502 instructions usually require the P (processor status, or flags) 
register to be updated after an ALU operation, the x86's ALU instructions 
updates it's flags register in a similar mannar. Therefore, after emulating 
the ALU behaviour of a 6502 instruction with an x86 one, use instructions 
like "LAHF" or "SETcc" to acquire the status of sign, zero, carry, and 
overflow condition codes. Furthermore, have your emulator store the 6502 
flags in the format that they're stored in on the x86 CPU. This way, the 
flags do not have to be formatted, thus saving time. The only time the flags 
will have to be converted to/from the 6502 order, is when 6502 instructions 
PHP, PLP, BRK #xx, RTS, and hardware interrupts are executed. Since these 
happen much less often than more common arithmetic and logical instructions, 
it's more efficient to handle the flags in this way.

- use the x86's registers to store all the 6502's internal registers in 
(including a cycle counter). There is enough room to leave 3 general purpose 
registers left for temporary use. However, don't go out of your way to do 
this; leaving some commonly used emulation variables in memory only incurs a 
small penalty, if any (usually a simple load from memory can execute in a 
single clock).


Advanced optimizations
----------------------
- dealing with paged/bankswitched memory.

Because the 6502's memory map is nonlinear, when the 6502 engine needs to 
read/write memory from/to a part of the NES's memory map, use part of the 
target address to index into a table of pointers. This table of pointers 
(hereon memory map translation table) will represent the architecture of the 
NES memory map (there will have to be 2 seperate tables of this type for 
handling reads & writes from/to NES memory). Each entry in the table 
represents a linear page (or bank) in the NES's memory map. Generally, this 
table will have to have a page bank granularity as small as 2K (i.e., 32 
entries), to emulate the mirrioring behaviour of the NES's internal RAM 
properly. Reserve a single bit in each pointer entry to indicate if the 
target page pointer points directly to a base memory address (in the x86 
memory map), or an x86 subroutine (used for emulating hardware behaviour 
when access to that address occurs). If the entry is a subroutine, then 
program control should be transferred to the pointer's address. Otherwise, 
no branch should occur, and the code that stores/loads the data via the 
pointer base (the original address will be used as the offset into the page) 
should be inlined within the opcode's code body. Also remember that 6502 
opcodes that work directly with stack or zero-page memory can avoid doing a 
look-up into the memory map translation table, since these areas are 
hardwired to the NES's internal 2K RAM.


- fetching instructions.

Because the 6502 is always fetching instructions, and in a sequential 
mannar, it is more efficient to maintain the 6502's PC as a 32-bit IPC 
register (Indexed PC) which points to the physical address at which the 
instructions reside. This avoids having to test the PC's address (with the 
memory map pointer table) every time an instruction byte needs to be fetched 
from memory.


- handling program control transfering

When the IPC must be modified via an absolute program control xfer 
instruction, using the new 16-bit PC address, the memory map page table is 
consulted to load the new base address for IPC (this base address is also 
saved for later). The PC value is then added to IPC to offset it into the 
page. When a PC value must be saved to the stack, the saved base address 
(mentioned previously) is subtracted from IPC to calculate the current PC 
value.

Note that support for executing 6502 instructions residing in a hardware 
register mapped part of memory will requre additional code.


- detecting when PC exceeds a page/bank boundary.

Because of the way IPC is implemented, when IPC crosses a page boundary, 
instead of the next virtual page being executed, the next physical page is 
executed. This will result in the CPU taking behaviour it never intended to 
take, and may result in compromising the stability of the game being 
emulated. This problem cannot be overcome, but it may be trapped. By placing 
3 of the same special opcodes at the end of these banks in physical memory, 
the page boundary crossing condition can be detected. Now, this doesn't give 
much hope towards running games that may potentially do this, but if the 
game did this anyways, it would probably hang the game without much 
explanation. By trapping this occurence, you'll know exactly what happened.


EOF